Skip to content

[TLE][MTHREADS] Support TLE Structure on mthreads backend#617

Merged
sunnycase merged 11 commits into
flagos-ai:triton_v3.6.xfrom
Kylin1207:pr/mthreads/dev_tle_structure
Jun 3, 2026
Merged

[TLE][MTHREADS] Support TLE Structure on mthreads backend#617
sunnycase merged 11 commits into
flagos-ai:triton_v3.6.xfrom
Kylin1207:pr/mthreads/dev_tle_structure

Conversation

@Kylin1207

@Kylin1207 Kylin1207 commented May 26, 2026

Copy link
Copy Markdown

MTHREADS backend support for the main TLE Structure primitives in this patch:

  • tle.gpu.memory_space(x, "shared_memory")

    • Marks a ranked tensor for shared-memory materialization.
    • Load inputs can lower through async global-to-shared copy.
    • Non-load tensor inputs materialize through initialized ttg.local_alloc + ttg.local_load.
    • Only "shared_memory" is supported on mthreads; "tensor_memory" and other spaces are rejected.
  • tle.gpu.alloc

    • Supports shared-memory buffers backed by ttg.local_alloc.
    • Supports explicit swizzled shared layouts and initialized allocations.
    • nv_mma_shared_layout=True/default is not supported on mthreads.
    • tmem allocation is not supported.
  • tle.gpu.local_ptr

    • Supports full-view and indexed shared-memory pointers, scalar and tensor indices, 1D/2D use cases, local load/store, masked tails, loops, dot operands, and runtime round trips.
    • Adds automatic barrier insertion for local pointer load-after-store hazards.
    • Adds optimizations that rewrite eligible full-view local pointer loads/stores to memdesc ops.
    • Extends Triton atomic operand type handling for address-space-3 shared-memory pointers.
    • Limitations: indices must be integer typed; scalar/tensor indices cannot be mixed; index rank must match buffer rank; only shared-memory buffers are supported.
  • tle.gpu.copy

    • Supports normal global-memory <-> shared-memory copies using tle.buffered_tensor.
    • Supports descriptor/TME-style copies in both directions: descriptor -> smem and smem -> descriptor.
    • Validates shape, dtype, buffer storage, descriptor offsets, and offset rank.
    • Descriptor copy requires offsets.
    • Normal copy currently requires the pointer tensor shape to exactly match the buffer copy shape.

Lowering path

Key differences from native Triton:

  • tle.gpu.memory_space(..., "shared_memory") is consumed early; no tt.memory_space marker remains after lowering.
  • tle.gpu.local_ptr introduces musa_tle.local_pointers, which is later optimized or lowered away before LLVM IR.
  • Descriptor-based tle.gpu.copy uses ttg.tma_copy as an intermediate but lowers to mthreads/MUSA TME ops such as
    ttmg.async_tme_copy_global_to_local, ttmg.async_tme_copy_local_to_global, and LLVM MUSA TME intrinsics, instead of native Triton TME lowering.
  • Normal tle.gpu.copy lowers through load/store plus local pointer paths, with mthreads-specific async-store optimization.

Performance Data

Benchmark source:

  • python/tutorials/tle/01-fft.py.
  • python/tutorials/tle/03-topk.py.

Note:
For MTHREADS testing, the tutorial currently requires manually replacing is_cuda with is_musa before running.

Environment:

  • Driver: 4.3.5
  • SDK: 5.1.0
  • Torch: 2.9.0

Baselines and results on large-shape cases:

  • TLE Radix TopK vs Triton TopK: 3.3x speedup.
  • TLE Radix TopK vs Torch TopK: 1.2x speedup.
  • TLE FFT vs Triton FFT: 1.9x speedup
  • TLE FFT vs Torch FFT: 20x speedup

@sunnycase sunnycase left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution and for adding TLE structure support for the mthreads backend.

Could you please update the PR description or add supporting documentation to explain which TLE primitives are implemented/supported by this work, and include performance benefit data so reviewers can evaluate whether the implementation scope matches the expected value?

It would be helpful to include:

  • The list of implemented TLE primitives, their semantic coverage, and any partial support or known limitations.
  • The lowering/runtime path for each key primitive, especially where it differs from the native Triton path.
  • Performance data: benchmark cases, input sizes, hardware/driver environment, baseline, before/after results, improvement ratio, and any regression cases.
  • If this PR is currently only structural enablement and has no measurable performance gain yet, please state that explicitly and describe the follow-up validation plan.

@zhzhcookie zhzhcookie left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sunnycase sunnycase left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sunnycase sunnycase merged commit fea4914 into flagos-ai:triton_v3.6.x Jun 3, 2026
11 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants